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Classification is one of the most popular and widely used supervised 
learning tasks, which categorizes objects into predefined classes based on 
known knowledge. Classihcation has been an important research topic 
in machine learning and data mining. Different classification methods 
have been proposed and applied to deal with various real-world problems. 
Unlike unsupervised learning such as clustering, a classiher is typically 
trained with labeled data before being used to make prediction, and 
usually achieves higher accuracy than unsupervised one. 

In this chapter, we hrst dehne classification and then review several 
representative methods. After that, we study in details the application 
of classification to a critical problem in drug discovery, i.e., drug-target 
prediction, due to the challenges in predicting possible interactions be¬ 
tween drugs and targets. 


1. Classification 

Classification is the process of finding a model or function that describes and 
distinguishes data classes or concepts.^ It is one of the most important tasks 
that supervised learning is applied to. Supervised learning is an important 
machine learning method which learns a model or a function with the help 
of supervision. Other than classification, supervised learning is also used 
for regression analysis. The goal of classification analysis is simply to know 
the class label while regression analysis is to learn a function. 

The rapid development of technologies, such as microarrays, high- 
throughput sequencing, genotyping arrays, mass spectrometry, and auto¬ 
mated high-resolution imaging acquisition techniques, has led to a dra¬ 
matic increase in availability of biomedical data.l^ Facing large amount of 
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data, computational method, which is cheaper and more efficient, arises 
to be useful complement to support traditional experimental method in 
many biomedical researches and applications. As an important data anal¬ 
ysis tool, classification has been applied for handling many important 
tasks in bioinformatics, including Sequence annotation,l^lt^ Protein func¬ 
tion prediction,!^ Protein structure prediction,!^ Gene regulatory network 
inference,!^ Protein-protein interaction prediction,ES! disease gene identifi- 
catiorP^Jli^and drug-target interaction prediction.!SHIZI Many of these tasks 
are to search the answer of a question with “yes” or “no”. For example, 
to predict whether two proteins interact or not, a protein is enzyme or non 
enzyme, a piece of sequence is coding or non-coding, and so on. This type of 
prediction can be directly handled with binary classification, where “yes” 
and “no” are treated as two class labels. It also can be solved through 
regression methods. Instead of directly answer “yes” or “no”, binomial 
regression methods produce the likelihood or the degree of being “yes” or 
“no”, based on which the final result can be easily obtained by cutting with 
a certain threshold, i.e., “yes” if the likelihood is larger than the threshold, 
and “no” if the likelihood value is below the given threshold. Next we give 
more detailed introduction on several representative classification methods, 
which are most widely used in computational biology. 

In the following subsections, we represent the training data consisting 
of n labeled examples or data objects as D = {x^, where each is 

a p-dimensional vector, i.e., x^ = [xn ... XipY' and yi is its associated class 
label. 


1.1. k-Nearest Neighbor (K-NN) 

k-Nearest Neighbor (K-NN) is instance-based classification. In K-NN, an 
unlabeled object is assigned to the most common class among its k most 
nearest neighbors in the training set. In order to decide the k nearest 
neighbors of the given object, the distance or closeness between this object 
and all the labeled objects need to be calculated. The number of neighbors 
k is an important parameter in k-NN. Setting k to different values, k-NN 
may produce different results. 

Now we use a simple example to illustrate how labeled data is used in 
k-NN to predict the class labels of those unlabeled objects. Fig. [T] shows 
a simple two-dimensional dataset. This dataset consists of seven labeled 
objects belonging to two classes and two unlabeled objects. First, we set 
k = 1. In this case, each unlabeled object is assigned to the same class as its 
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nearest neighbor. Fig. [2]shows the classification result of the two unlabeled 
objects with k = 1. Since the nearest neighbor of the first unlabeled object, 
i.e., the one that is located at the left lower corner, is labeled as class 1, 
this object is also labeled as 1. Similarly, since the nearest neighbor of the 
other unlabeled object is labeled as class 2, the class label is also predicted 
as class 2 for this object. When fc > 1, the neighbors of an unlabeled object 
possibly have different class labels, and in such cases, the unlabeled object 
is typically assigned to the most common class among its neighbors. Fig. 
[3] shows the classification result of k-NN with k = 3. It is seen that the 
object in the left lower corner is still labeled as class 1 as all its three nearest 
neighbors are in this class. However, the other object is now labeled as class 
1 as two of its nearest neighbors belong to class 1, although its most nearest 
neighbor belongs to class 2. Here, once k is decided, all the neighbors are 
considered to be equally important in deciding the class of the unlabeled 
object. Another way is to assign different weights to the neighbors so that 
the k neighbors have different levels of significance of their votes. 

1.2. Support Vector Machine 

The classic Support Vector Machine (SVM) is a linear binary classifier. 
Given a p-dimensional dataset where the training samples belong to two 
classes, the goal of a linear classifier is to find a p—1 dimensional hyperplane 
which separates the samples in the two classes as illustrated in Fig. 0] 
Among many of such kind of hyperplanes, the one maximizes the separation 
or margin of the two classes is of most interest, and the corresponding 
classifier is called the maximum margin classifier. In SVM, the margin is 
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Fig. 2. Classification result of k-NN with k = 1. 
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Fig. 3. Classification result of k-NN with k = 3. 

the distance from the hyperplane to the nearest samples in each of the 
classes. Samples located on the boundary of each class are called support 
vectors. 

1.2.1. Linear SVM 

Now we formally define the linear SVM. For a set of n training samples, 
where each object with label —1 or 1 is a p-dimensional vector, we may 
represent it as D — {{xi,yi)\xi S R^,yi S {—1, where X represents 

the data and Y represents the label information. Assume that the dataset 
is linearly separable, then there exist w and b such that the inequalities 
are valid for all Ki G D: 


■Ki-b>l if j/i = 1 

(1) 

1 — 1 

1 

1 — 1 

1 

VI 

~o 

1 

(2) 

















Fig. 4. Example of linearly separable dataset in a two dimensional space. An optimal 
hyperplane is the one that maximizes the distance between two classes. 

The above two equations can be written into one as below 

?/i(w • Xi - 6) > 1 (3) 

Among the training samples, vectors for which 

j/i(w • Xi - 6) = 1 (4) 

are called support vectors, which define the boundary of the two classes. 

The distance or margin between the two classes is n^. The goal is to 
find the optimal hyperplane or to decide w and b to maximize this margin 
subject to m, which requires all the training samples to be correctly clas¬ 
sified. Since maximizing is equivalent to minimizing ^||w|p, we can 
solve the above maximization problem by solving the equivalent minimiza¬ 
tion problem as below 


minf||wf (5) 

subject to 

2 /i(w • Xi — 6) > 1 for i = l,2,...,n. (6) 

This constrained optimization problem can be solved with the method 
of Lagrange. By introducing Lagrange multipliers a^, the Lagrangian is 
constructed as 


1 

2 l|wf - ^a*(yi(w • xi - 6) - 1) 
2=1 


( 7 ) 
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which can be solved by standard quadratic programming techniques. Ac¬ 
cording to the Karush-Kuhn-Tucker conditions, the solution of w is in the 
form as below: 

n 

w = ^ aiy^Xi (8) 

i=l 

The above formula shows that w is a linear combination of the training 
samples. When j/i(w ■ Xi — b) = 1, at > 0; for other cases, ai = 0. This 
means that w is only defined by a small number of support vectors, i.e., 
the training samples located at the boundary of the classes, rather than all 
the training samples. 

In the above formulation, we assume that the dataset is linearly sepa¬ 
rable, or there exist a hyperplane that can divide the samples according to 
their class labels without any classification error. In cases that such kind 
of hyperplane does not exist, we may want to find a hyperplane that cor¬ 
rectly divide the samples as many as possible. This is called the soft margin 
method. Slack variables > 0 are introduced to formulate this idea. The 
constraints are now become 


yi{w (9) 

Since a larger corresponds to a larger error in the classification of Xi, 
we want to penalize large through minimizing the objective function as 
below 

1 ” 

min-ll'wf(10) 

i=l 

where C is the weight parameter of the penalty term. With Lagrange 
multipliers ai > 0 and Pi > 0, the problem to be solved is written as 

^ n n n 

min-||wf - 6) - 1-b^i) - (H) 

2=1 2=1 2=1 


1.2.2. Kernel SVM 

In many cases, the data is not linearly separable. As illustrated in Fig. O 
mapping the original space into a high or infinity dimensional feature space, 
i.e., X — >• possibly makes the data easier to be separated. Kernel- 

based approach use a kernel function k to calculate the inner product of 








March 13, 2015 1:18 


World Scientific Review Volume - 9in x 6in 


classification'app 


7 



Fig. 5. A non-linearly separable dataset becomes linearly separable after mapping (p 


the vectors in the high dimensional space in terms of the vectors in the 
original space: 



K{Xi,Xj) = (/)(Xj) • ())(Xj) = (j){Xi)'^(j){Xj) 

(12) 

As w = w^w, substituting (|S]) into (I7|), the dual of SVM 
following optimization problem: 

is the 


n ^ n n 

inax = E “ 2 E E 

i—1 i—1 j—1 

(13) 

subject to 

ai > 0 for i = 1,2,... ,n. 

(14) 


n 

a^y^ = 0 

(15) 


i=l 


By substitute with ^(xi) in the above formula, we get the objective 
function in the mapped space, and with the kernel function given in (1121) . 
we have the following form without defining the mapping explicitly: 

n ^ n n 

max = aiOjUiUjuixi^Xj) (16) 

i—1 i—1 j—1 

Below are the three commonly used kernels: 


• Polynomial kernel 


/c(x,,Xj) = (x, •Xj+l)'^ 


• Gaussian kernel 



(17) 


k(x,,Xj) = e 


(18) 
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• Hyperbolic tangent kernel 

Xj) = tanh(hs.i ■ Xj + c) (19) 

where h is the scale factor and c is the offset. 

Since any positive-definite matrix could be treated as a kernel matrix, kernel 
SVM can be used to make prediction based on a similarity matrix, which 
records pairwise similarities between objects. To make sure kernel SVM 
performs stably, some preprocess is needed if the similarity matrix given is 
not positive-definite. 


1.3. Bayesian classification 


Bayesian classifiers are statistical classifiers based on Bayes theorem. A 
Bayesian classifier generates the probability or membership of an object 
with respect to each of the classes. Assume X is an object that is to 
be classified or labeled and Y is the hypothesis that X belongs to some 
class, then P{Y = c/X) is the probability that X belongs to the cth class. 
According to the Bayes theorem, this posterior probability of V = c condi¬ 
tioned on X can be calculated with posterior probability PiXjY = c), and 
prior probabilities P{X) and P{Y): 


P{Y = c/X) 


P{X/Y = c)P{Y = c) 


( 20 ) 


In the above formula, P{X) is constant for any c. If P{Y = c) is unknown, 
it is usually assumed that all classes have equal probability or it is estimated 
by the ratio of the number of objects in class c. The left problem is 
how to calculate P{X/Y = c). To simplify computation, the values of 
attributes are assumed to be conditionally independent to each other, i.e., 
given the class label of an object, there are no dependence relationships 
among the attributes. Assume Xj is the value of the jth feature, and there 
are p features in total, then based on this assumption. 


p 

PiX/Y = c) = l[P{x,/Y = c) 
i=i 


( 21 ) 


and the classifier is called the Naive Bayes Classifier. 
If the fcth attribute is categorical, then 

'fijc 


P{xj/Y = c) 


n, 


( 22 ) 
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where ric is the number of objects in class c, and Ujc is the number of 
objects in class c that have the value of the /cth attribute equal to Xj. 

If the /cth attribute is continuous-values with a probability distribution 
g, e.g., the Gaussian distribution with mean gc and standard deviation 
then 


P{x,/Y = c)=g{xj) = -^e (23) 

Once the posterior probabilities PiXjY = c) for all c = 1, 2,..., A: are 
calculated, X is assigned to the class with the largest posterior probability, 
i.e., X is labeled as class /, where / = argmaxc P{XlY = c). 


1.4. Decision Trees 

A decision tree is a tree structure where each internal node denotes a test 
on an attribute, each branch denotes an outcome of the test, and each leaf 
node represents a class. Once a decision tree has been constructed with 
training data, a new sample is tested against the decision tree from the top 
node to the leaf node which corresponds to the predicted class of the new 
sample. 

Given a set of training objects, a decision tree is built in a top-down 
recursive divide and conquer manner. A critical problem need to be consid¬ 
ered in construction of the tree is how to select the attributes for testing. 
Entropy or equivalently information gain and Gini index are commonly 
used for attribute selection. The entropy measures the purity of the parti¬ 
tions, the smaller the entropy or the larger the information gain, the purer 
the partitions are. Thus, the attribute with the minimum entropy or high¬ 
est information gain is chosen as the test attribute for the current node. 
Assume the training data consists of n labeled objects are distributed in 
k classes, each class contains Uc objects, then the expected information 
needed to classify a given sample is 


k 

A = - ^ Pclog2Pc (24) 

C=1 

where Pc is the probability that an arbitrary object belongs to class c. It is 
estimated by For a feature a, which has h distinct values, the entropy 
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Table 1. Weather Data 


id 

Outlook 

Temperature 

Humidity 

Windy 

Play 

1 

Sunny 

Hot 

High 

False 

No 

2 

Sunny 

Hot 

High 

True 

No 

3 

Overcast 

Hot 

High 

False 

Yes 

4 

Rainy 

Mild 

High 

False 

Yes 

5 

Rainy 

Cool 

Normal 

False 

Yes 

6 

Rainy 

Cool 

Normal 

True 

No 

7 

Overcast 

Cool 

Normal 

True 

Yes 

8 

Sunny 

Mild 

High 

False 

No 

9 

Sunny 

Cool 

Normal 

False 

Yes 

10 

Rainy 

Mild 

Normal 

False 

Yes 

11 

Sunny 

Mild 

Normal 

True 

Yes 

12 

Overcast 

Mild 

High 

True 

Yes 

13 

Overcast 

Hot 

Normal 

False 

Yes 

14 

Rainy 

Mild 

High 

True 

No 


based on the partitioning into k subsets by a is calculated by 

h 

i?(a) = ^P(j)£;(j) (25) 

i=i 

where P{j) = rij is the number of objects of which the value of feature 
a is equal to j, and 


k 

= “ X! Pcj^'^92Pcj (26) 

C=1 

is the entropy of the jth value of the ath attribute, Pcj = ^ 

G{a) =E- E{a) (27) 

Now we use the Weather data in Table [T] as an example to show how 
to calculate the Entropy of each attribute. This data consists of fourteen 
samples described by four attributes: Outlook, Temperature, Humidity and 
Windy. These fourteen samples belong to two classes: Play or Not-Play. It 
is shown that P{Play = Yes) = P{Play = No) = So the Entropy 
of Play or the expected information needed to classify a sample is 

Now we calculate the Entropy of the attribute Outlook. This attrbute has 
three values Sunny, Overcast, and Rainy, which occurs 5, 4, and 5 times, 
respectively, i.e., P{Sunny) = PiJDvercast) = ^ and P{Rainy) = 
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Among the five samples of which Outlook is Sunny, two are 
Not-Play, thus the Entropy of Sunny is 

Play, three are 

2 2 3 3 

E(Sunny) = -(-log 2 - + ^^ 052 ^) = 0.971 

(29) 

Similarly 


4 4 

E(Overcast) = —( — log 2 — + 0log20) = 0 

(30) 

3 3 2 2 

E(Rainy) = -(-log 2 - + 5 ^ 052 ^) = 0.971 

(31) 


(32) 

So the Entropy of Outlook is 


E (Outlook) 

(33) 


= P (Sunny) E{ Sunny) + P{Overcast)E(Overcast) + P (Rainy) E (Rainy) 

(34) 


= ^0.971 + -^0 + Ao.971 = 0.694 
14 14 14 


(35) 


and the Information Gain of Outlook is 


Gain(Outlook) = E — E(Outlook) = 0.940 — 0.694 = 0.246 (36) 

With the same steps, we can calculate the Gain of the other three at¬ 
tributes: Gain (Temperature) = 0.029, Gain(E[umidity) = 0.152, and 
Gain(Windy) = 0.048. Since Outlook has the largest Gain, it is the best 
attribute of the current stage that should be selected for testing. 

Once the best attribute is decided and represented as an intermediate 
node of the tree, branches below this node are added where each branch 
corresponds to a possible value this attribute takes. For each value, take 
the subset of samples having this value of the current attribute as the input 
of the next iteration for further splitting. This process continues until all 
samples under consideration have the same class label. A complete decision 
tree of the Weather data is shown in Fig. [51 

The tree constructed to correctly classify all the training samples may 
be over-fitting. Pruning handle the over-fitting problem by removing least 
reliable branches. Other than a higher classification accuracy, pruning also 
results in a simplified tree which makes the test process faster. Pruning 
performed during the construction of the tree is called Prepruning. It stops 
the construction early with less purity. Pruning can also be performed 
by removing branches from a fully grown tree. This type is called the 
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Fig. 6. The decision tree of the Weather data. 




Fig. 7. The decision tree of the Iris data, (a) The unpruned three, (b) The tree with 
pruning. 

post-pruning. Fig. 0 shows the unpruned and pruned decision three of 
the Fisher’s iris data. This dataset consists of 50 samples from each of 
three species of Iris (setosa, virginica and versicolor). Four features were 
measured from each sample: sepal length (SL), sepal width (SW), petal 
length (PL), and petal width (PW). 

IDS is a popular decision tree algorithm proposed by Ross Quinlan,!^ 
and C4.di^ is an extension of IDS with improved computing efficiency, and 
other more functions, including dealing with continuous values, handling 
attributes with missing values, and avoiding over fitting. Another algo¬ 
rithm called Classification and regression trees (CART) proposed by Leo 
BreimarPSl produces either classification or regression binary trees, depend¬ 
ing on whether the dependent variable is categorical or numeric, respec¬ 
tively. The study irP reviews tree-based classification approaches and their 
applications in bioinformatics. 
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1.5. Regression models for classification 

Other than these previously reviewed supervised learning methods which 
are widely used for classification, regression models may also be used for 
classification analysis. Regression methods the relationship between a de¬ 
pendent variable and one or more independent variables. Specifically, re¬ 
gression is to analyse how the value of the dependent variable changes when 
any one of the independent variable varies while other independent variables 
fixed. The dependent variable is the output variable or response variable, 
and the independent variables are input variables or explanatory variables. 
Next we discuss two regression models namely the Logistic Regression and 
Regularized Least Squares, which are frequently used for classification pur¬ 
pose. 


1.5.1. Logistic Regression 

Logistic Regression is a type of binomial regression that predicts the prob¬ 
ability of the outcome of a “yes or no” type trial using logistic function. 
Formally, the Logistic Regression models the relation between dependent 
variable yi and independent variables Xj = {xn ... XipY' by 


Vi = 


1 

g-(xf/3+e0 I 


(37) 


where (3 = (/3i .. ■ fipY s-re regression coefficients, and Cj is the error term. 
Let 


p 

t = '^l3jXij + Cl = nf (^ + €t for z = l,2, ...n (38) 

i=i 

then yi = f{t), where f{t) is the logistic function 

m = ( 39 ) 

A property of the logistic function is like distribution function, its output 
is between 0 and 1 for any input in the full range from negative infinity 
to positive infinity, i.e., f{t) G [0,1] for t G (— 00 , 00 ). The coefficients 
are usually estimated with maximum likelihood estimation with iterative 
algorithms such as Newton’s Method. Once the coefficients are learned, the 
logistic regression can be used for binary classification where the predicted 
value iji is the probability of being “yes”. 
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1.5.2. Regularized Least Squares 

Unlike many other regression models, such as Logistic Regression, the Reg¬ 
ularized Least Squares (RLS) method does not require the examples to be 
represented as feature vectors explicitly as it learns the model and makes 
prediction with a kernel matrix K, where each entry kij G K = «;(xi,Xj) 
is defined by a certain kernel function, e.g., Gaussian kernel in (jl8p . For 
a dataset with labels y = (j/iy 2 ■ • ■Un)'^, and kernel matrix K, the Reg¬ 
ularized Least Squares (RLS) is to find coefficients c = (ciC 2 ... c„)^ to 
minimize the following value 

i||y-Kc||2 +Vkc (40) 

where the first term is the least squares term and the second term is the 
regularization term with weight 6. The solution of c that minimizes the 
above value has a simple closed form as below 

c = (K + (5/)-V (41) 

Once c is obtained, we can use it to predict the label y of a new data object 
X by 

y = F(K + 5I)-V (42) 

k is an n-dimensional vector where each dimension ki is the value of the ker¬ 
nel function between this object and a training example, i.e., ki = K(x,Xi). 

In real applications, the similarity matrix recording a certain type of 
similarity between each pair of examples may be treated as a kernel matrix. 
Since kernel matrix is positive definite, some preprocessing may be needed 
to transform the given similarity matrix into a positive definite matrix. 

1.6. Ensemble classifier 

An ensemble classifier is not a specific type of classifier as those introduced 
earlier. Instead, it is a classifier ensemble, which combines or aggregates 
the predictions of several individually trained classifiers called base clas¬ 
sifiers to produce a final result. A simple enselble classifier is illustrated 
in[^ Through aggregating, the prediction of an ensemble classifier is usu¬ 
ally more accurate than any of the individual classifiers. An important 
problem is how to train each of the base classifiers. Since ensemble makes 
sense only if the outputs of the base classifiers are different. To generate 
disagreements in the prediction, base classifiers may be trained with differ¬ 
ent initial weights, different parameters, different subsets of features, and 
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different portions of training set. The two well known ensemble methods: 
Baggingpl and Boostin^^^HSl mainly focus on the last way to train the base 
classifiers, and the other well known method Random ForeslP^ makes use 
of the last two ways. 

In the Bagging method, each classifier is trained on a random sample of 
the training set. More specifically, a set of sample to be used for training 
a base classifier is generated by randomly drawing with replacement from 
the training samples. Although each individual classifier could result in 
higher test-set error when trained with a subset of training samples, the 
combination of them can produce lower test-set error than using the single 
classifier trained with all the training samples.l^ showed that Bagging is 
effective on “unstable” learning algorithms, such as decision tree and neural 
network, where small changes in the training set result in large changes in 
predictions. Unlike Bagging, where the generation of training set for one 
classifier is independent on other classifiers, in Boostingthe training 
set used for each base classifier is chosen based on the performance of the 
earlier classifiers. Examples that are incorrectly predicted by previous clas¬ 
sifiers are selected more often than those were correctly predicted. Doing 
this. Boosting attempts to make subsequent classifiers be better able to pre¬ 
dict examples for which the current ensemble’s performance is poor. The 
Random ForeslP^ combines the Bagging idea to select training samples and 
random selection of features. The selection of a random subset of features 
is an example of the random subspace method,^ which is especially useful 
for handling high-dimensional data, e.g., gene expression data. Projecting 
the original high dimensional space into different low subspaces so that the 
problems caused by high-dimensionality are avoid. Although decision tree 
is often used as base classifiers in these ensemble methods, other types of 
classifiers may also be used to produce base predictions in an ensemble. 

Once all the base classifiers are trained, they generate predictions for 
new samples to be classified. Voting is a commonly used way to combine 
these predictions to give the final class label for the input. Assuming that 
the majority of the classifiers would make the correct prediction, voting 
labels the sample as the class that predicted by most of the base classifiers. 
Instead of equally weighting the classifiers, the aggregating weights for each 
base classifier may also be adapted according to their performance.^^ 
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Output 


Combine 





Input 


Fig. 8. An ensemble of classifiers. 


2. Drug-target interaction prediction 

In this section, we take the drug-target interaction prediction as an example 
to present detailed discussion on how classification is used to handle a spe¬ 
cific task in biology. Some background knowledge of drug-target interaction 
prediction is first given. After that, recently studies on using classification 
for drug-target interaction prediction are discussed. Finally, experimental 
studies on benchmark datasets are given to evaluate the performance of 
several different classification approaches in drug-target interaction predic¬ 
tion. 

2.1. Background 

Identification of drug-target interaction is an important part of the drug dis¬ 
covery pipeline. The great advances in molecular medicine and the human 
genome project provide more opportunities to discover unknown associa¬ 
tions in the drug-target interaction network. These new interactions may 
lead to the discovery of new drugs and also are useful for helping under- 
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stand the causes of side effects of existing drugs. Since experimental way to 
determine drug-target interactions is costly and time-consuming, in silica 
prediction comes out to be a potential complement that provides useful 
information in an efficient way. 

Traditional approaches for this task are generally categorized into drug- 
based approaches and target-based approaches. Drug-based approaches 
screen candidate drugs, compounds or ligands to predict whether they in¬ 
teract with a given target based on the assumption that similar drugs share 
the same target. The similarity of two drugs are measured in different ways 
with respect to different aspects. Other than comparing drugs according 
to their chemical structures,!^ side-effect has also been used to measure 
the similarity between drugs.!^ Assuming that similar targets bind to the 
same ligand, target-based approaches, on the other hand, compare proteins 
to predict whether they bind to the given ligand, or whether they are the 
targets of the given drug or compound. More specifically, for a given drug, 
new targets are identified by comparing candidate proteins to the known 
targets of this drug with respect to certain descriptors such as amino acid 
sequence, binding sites, or ligands that bind to them. The authors of!^ re¬ 
view computational methods to find new targets for already approved drugs 
for the treatment of new diseases based on the structural similarity of their 
binding sites. Candidtae targets are compared by the chemical similarity 
of ligands that bind to them.!^ Different from these classic drug-based or 
target-based approaches, chemogenomics approaches have been proposed to 
consider the interactions between drugs and a protein family rather than a 
single target 

Recently, machine learning approaches have been applied to this task to 
explore the whole interaction space. In the supervised bipartite graph learn¬ 
ing approach,the chemical space and the geometric space are mapped into 
a unified space so that those interacted drugs and targets are close to each 
other while those non-interacted drugs and targets are far away from each 
each other. After the mapping function to such a unified space is learned, 
the query pair of drug and target are also mapped in the same way to 
that unified space, and the probability of interaction between them is the 
closeness that they are in the mapped space. It has been shown that the 
combination of supervised learning independently based on drug and tar¬ 
get performs very well.!^^ This approach is called the Bipartite Local Model 
(BLM). For a query pair of drug and target, a model of the query drug is 
learned with a certain classifier based on the information of its known tar¬ 
gets. Then the probability of interaction between this drug and the query 
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target is predicted with this model. The same procedure is applied to 
obtain the probability of interaction between them from the target side. 
Finally, an overall probability of interaction for the query pair is calculated 
by combing these two probabilities. It has been reported that the result 
based the knowledge of both directions, i.e., from the drug side and from the 
target side, is much better than those based on each single one. The same 
idea is adopted by another two following work. Semi-supervised approach is 
used instead of supervised approach to learn the local model.l^ Laarhoven 
found that only use the kernel based on the topology of the known interac¬ 
tion network is able to obtain a very good performance, although together 
with other types of similarities can further improve the results.l^ Other 
than using one type of drug-drug similarity and one type of target-target 
similarity,^ use multiple types of drug-drug similarities and target-target 
similarities and combine them as features to describe each drug-target pair 
to learn the logistic regression model. Next, we present the details of how 
the drug-targe prediction task is handled by three types of classification 
problems. 

2.2. A binary classification problem 

A relatively stratforward way to predict whether a given pair of drug-target 
interacts is to model it as a binary classification problem as in Ref.^^ The 
key problem is how to extract a set of features based on different biological 
sources to charactorize or represent each drug-target pair. This has been 
done in three steps in Ref.l^ First, five drug-drug similarities and three 
gene-gene similarities are calculated based on different bilological and chem¬ 
ical sources. Then, the drug and gene similarity measures are combined as 
features to describe each drug-target pair. Feature selection is performed to 
select important features. Finally, the classifier is trained with the labeled 
samped decribed with selected features. In this study. Logistic regression 
is used for classification. 

The whole process is shown in Fig. ep The drug-drug similarity mea¬ 
sures were computed using chemical strucute, Ligand, drug side effects, 
drug response gene expression profiles, and the Anatomical, Therapeu¬ 
tic and Chemical (ATC) classification system code. The gene-gene simi¬ 
larity measures used are based on protein-protein interactions, sequence, 
and Gene Ontology (GO). Once all these drug-drug similarities and target- 
target similarities are obtained, each feature is constructed based on one 
drug-drug similarity and one target-target similarity. Specifically, calcu- 
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Table 2. Comparison of AUC and AUPR for the four datasets 



All features 

0.905 

0.935 


selected features 

0.908 

0.935 

Ligand 

Sequence similarity 

0.851 

0.867 

Ligand 

GO semantic similarity 

0.845 

0.867 

Predicted Side Effect 

GO semantic similarity 

0.832 

0.863 

ATC similarity 

GO semantic similarity 

0.81 

0.858 

Ligand 

PPI closeness 

0.809 

0.844 

Chemical 

GO semantic similarity 

0.805 

0.84 

ATC similarity 

PPI closeness 

0.762 

0.809 

Chemical 

Sequence similarity 

0.749 

0.763 

Predicted Side Effect 

PPI closeness 

0.729 

0.759 

Co-expression 

Sequence similarity 

0.724 

0.748 


lated by combining the drug-drug similarities between the query drug and 
other drugs and the gene-gene similarities between the query gene and other 
target genes across all true drug-target associations. Therefore, fifteen fea¬ 
tures are constructed in such a way. After feature selection, ten features 
are finally selected. Table [2] shows the results in terms of AUC (area un¬ 
der ROC curve) and AUPR (area under precision-recall curve) with all the 
features, all the selected features and each single selected feature. Here 
AUC and AUPR are two performance evaluation measures. It is shown 
that using ten selected feature gives a comparable result with all the hf- 
teen features, which is much better than using any of a single feature. It 
is also shown that when used indivudually, the combination of Ligand and 
sequence similarity gives the best feature. Once each drug-target pair is 
represented as a vector of these feaures, the prediction problem of whether 
a query pair interacts simply becomes a binary classification problem that 
can be solved by many existing classification algorithms, e.g., the Logistic 
regression as used in this paper. Other than develping a good data pre¬ 
sentation through aggregation of multiple data sources, some other studies 
focus more on design of new learning algorithms. Next we introduce two re¬ 
cently proposed learning algorithms, namely the Bipartite Graph Learning 
and the Bipartite Local Model. 

2.3. Bipartite graph learning (BGM) 

We assume that the problem under consideration is to predict new inter¬ 
actions between drugs and nt targets. An na x nt matrix A is used to 
record these known interactions, i.e., aij G A = 1 if the fth drug denoted as 
di, is known to interact with the jth target denoted as tj. All other entries 
of A are 0. Assume Ui interactions in total involves drugs and mt tar- 
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Fig. 9. Algorithm pipeline. (A) comprised of formation of drug-drug and gene-gene 
similarity matrices, (B) integration of the similarities to classification features, (C) clas¬ 
sification with feature selection. 

gets and < Ud and rrit < Ut- This means there are some new drug and 
target candidates and the corresponding rows and columns of A are all 0. 
Other than the interaction network, and Sj are the chemical similarity 
matrix of drug and the sequence similarity matrix of target, respectively. 

The bipartite graph learning method learns the correlation between the 
chemical/genomic space and the interaction space, which is called the ‘phar¬ 
macological space’. As illustrated in Fig. unP first, the compounds and 
proteins are embedded into a unified space called ‘pharmacological space’. 
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The mapping function or model between the chemical/genomic space and 
the pharmacological space is learned. With this model, any query pair of 
compounds and proteins are mapped onto the same pharmacological space. 
The compound-protein pairs under testing are predicted to be interacting 
if the two are closer than a threshold in the pharmacological space. The 
whole process consists of the following stepsf^ 


• Step 1: construct a graph-based similarity matrix 

xc — ( Keg 

\^cg ^gg 

where the entries of each matrices are calculated as 


Kec = exp{ - 

Kgg = exp{ -||i) 

dtq- 

Kqg=exp{-^) 


(43) 

(44) 

(45) 

(46) 

(47) 


where d is the shortest distance between two objects (compounds 
or proteins) on the bipartite graph. The symmetric matrix K has 
a scale of (ric -I- Ud) x (uc -I- Ud). After K is constructed, eigenvalue 
decomposition is performed to K to get U: 

K = ^ (48) 


where A is the diagonal matrix with the diagonal elements 
the eigenvalues and the columns of matrix T are the corre¬ 
sponding eigenvectors. Write U with its row vectors: U = 

(Uci j ■ • ■ j Ucric, • j Ugn^ ) . 

• Step 2: For * = c„} and j = {1,..., learn Wei and Wgg 

by assuming the following relation, which is a variant of the kernel 
regression model: 


ria 


-- ^ Sc{x,Xci)'Nci + e 

(49) 

i=l 


Ug 


^Sg{x,Xgj)Wgj^e 

(50) 

i=i 

(51) 
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a Step 3: mapping the query compound Cq 
learned Wc, and W^: 

and protein pq with 


Uc 

^cq — ^ ^ ^c{^q-> 
i^l 

(52) 


Ug 

^9g ~ ^9i9q^ 9j)'^gj 

(53) 



(54) 

a Step 4: The score of interaction between Cq and Pq denoted as 
Pcq,gq is calculated as the inner product of the feature vectors in 
the mapped space 


Pcq,gq — < Ucq,Ugq > 

(55) 


2.4. Bipartite local model (BLM) 

To predict pij, the probability that a drug di and a target tj interacts, the 
basic bipartite local model is described as follows. A local model of di is first 
learned based on the known targets of this drug and the similarities between 
these targets. This model is then used to predict pfj** the probability of 
interaction between this drug to the tested protein. The model learning 
and prediction process is performed independently from the query target 
side to get Once both pf^ and pb are calculated, they are combined 

with some function / to get the final result pij = fijPif*,p\f'^)- Fig. [Tf£l 
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illustrates the idea of drug-target interaction prediction with learning from 
the drug and target independently. 

This framework was first proposed in,^ and then was further studied in 
Ref.l^ and Ref.^^ Under the same BLM framework, different results may be 
produced due to the differences in drug-drug similarity and target-target 
similarity S*, the classifier, and the way how pfj*'" and is combined, 
i.e., the function /. For example, in,Support Vector Machine (SVM) is 
used as the classifier using the chemical structure similarity for drug and 
sequence similarity for protein targets, respectively. The same types of 
similarity data is used in,l^ but with a semi-supervised approach for lo¬ 
cal model learning. In,l^ network topology based similarity for drug and 
target are calculated and combined with the chemical structure similarity 
and sequence similarity, respectively, to give the final pairwise drug simi¬ 
larities and pairwise target similarities, and the Regularized Least Squares 
(RLS) is used for model learning. So far, simple combination functions 
are shown good enough to get the final prediction based on the two in¬ 
dividually obtained ones, e.g., pij = max{pf~**,plj*‘^} is used in,E^ and 
Pij = 0.5{pff* +pI~"'‘) is used in.l^ 


2.5. Enhanced BLM with training data inferring for new 
drug/target candidates 

Generally, supervised learning performs better than unsupervised learning. 
However, a good performance of supervised learning is largely dependent 
on the amount and quality of the labeled training data. When the drug 
candidate is new, it has no existing targets that can be used as positive 
labeled training data and the model for this drug thus cannot be learned. 
Similarly, supervised local model learning does not work for new target 
candidates. To extend the application domain of BLM to new drug and 
target candidates, in Ref. ,1^ we present a training data inferring procedure 
and integrate it into BLM. Based on the assumption that drugs which are 
similar to each other interact with the same targets, training data for a new 
drug candidate could be possibly inferred from its neighbors. The neighbors 
of a new drug candidate generally refer to those drugs that share some 
similar properties with the new drug candidate, e.g. similar in chemical 
structure. 

For a drug candidate di that has no known targets, we infer the weighted 
interaction profile for di with the following formula 
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Given: 

Sd -- drug-drug similarity, 

St" target-target similarity 
A " drug-target interactions 

To predict: 

interaction between dj and tj 


\ 


Interaction prediction from 
the drug-candidate dj 


Interaction prediction from 
the target-candidate tj 


I 


Pirfivtr"' 

The final possibility of 
interaction between dj and tj 


Fig. 11. Drug-target interaction prediction with learning from the drug and target 
independently. 



1(*) = s-A 

(56) 

where each dimension 

nd 



= ^4hah3 

(57) 


h^l 



Here vector sf is the ith column of S^, which records the similarities be¬ 
tween di and all the other drugs, sff^ is the similarity between two drugs di 
and dh, and vector 1(*) is the inferred interaction profile for di, where each 
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dimension lj(i) corresponds to the weight of the interaction between di and 
tj. The above formula shows that the interaction weight of di with respect 
to the jth target is the sum of interactions between its neighbors and this 
target weighted by the similarity between this drug and its neighbors. More 
specifically, this simple formula defines that for a given new drug candidate 
di, its weight of interaction with respect to a target is high if many of its 
neighbors interact with this target, and the final weight to a target is influ¬ 
enced more by a neighbor with a larger similarity than those with smaller 
similarities. To allow neighbors with large similarities only to contribute, a 
threshold may be used to reduce the impact of those non-important neigh¬ 
bors to 0. Alternately, an exponential function with bandwidth f3 given as 
below may be introduced: 

1(f) = (58) 

To ensure the value of each lj{i) is in the range of [0, 1], linear scale is 
performed subsequently. The procedure of inferring training data for new 
target candidates is not discussed in details here as it is similar to the 
procedure of inferring training data for new drug candidates as presented 
above. 

Learning from neighbors allows drugs and targets to obtain training 
data when themselves do not have any known interactions. This proce¬ 
dure actually introduces some degree of globalization into the original local 
model to give more chances or an enlarged scope for the learning process. 
However, too much globalization is not desired as it will decrease the local 
characteristics and make the models for each drug or target less discrimi¬ 
native. Moreover, the low quality of neighbors may add in noise and cause 
a negative impact when neighbors’ preferences are too much relied upon. 
In the current study, we only activate the neighbor-based training data in¬ 
ferring for totally new candidates. For other cases, we still train the model 
locally on its own preference, i.e., the known interactions. 

3. Experimental study 

Now we give some experimental results to compare the performance of the 
BGM method, the BLM method and the BLMN method for the task of 
drug-target interaction prediction. From the experimental results, we have 
the following observations: first, BLM-based approaches outperform BGM; 
second, with neighnor-based training data inferring, BLMN performs better 
than the classic BLM; third, network topology based similarity is helpful 
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Table 3. Some statistics of the four datasets. nd‘. the total num¬ 
ber of drugs, nt: the total number of targets, E\ the total num¬ 
ber of interactions, the average number of targets for each 

drug, Dt'. the average number of targeting drugs for each target, 
Dd = 1: the percentage of drugs that have only one target, and 
Dt = 1: the percentage of targets that have one targeting drug. 


Dataset 

Enzyme 

Ion Channel 

GPCR 

Nuclear Receptor 

rid 

445 

210 

223 

54 

nt 

664 

204 

95 

26 

E 

2926 

1476 

635 

90 

Dd 

6.58 

7.03 

2.85 

1.67 

Dt 

4.41 

7.24 

6.68 

3.46 

Dd = 1(%) 

39.78 

38.57 

47.53 

72.22 

Dt = 1(%) 

43.37 

11.27 

35.79 

30.77 


to improve the prediction. 

3 . 1 . Datasets 

The four groups of datasets have been first analysed bji^l and then later by 
several other researchers! ^^ * ^^ * ^ -^*^ These four datasets correspond to drug- 
target interactions of four important categories of protein targets, namely 
enzyme, ion channel, G-protein-coupled receptor (GPCR) and nuclear re¬ 
ceptor, respectively Table [3] gives some statistics of each of the datasets. 

Each dataset is described by three types of information in the form of 
three matrices. Together with the drug-target interaction information, the 
drug-drug similarity, and target-target similarity are also available. Four 
interaction networks were retrieved from the KEGG BRITE,I^BRENDA,ESI 
Super Target!^ and DrugBanlP^ these four databases. The drug-drug simi¬ 
larity is measured based on chemical structures from the DRUG and COM¬ 
POUND sections in the KEGG LIGAND databas^^and is calculated with 
SIMCOMP.1221 The target-target similarity is measured based on the amio 
acid sequences from the KEGG GENEsS databas^^and is calculated with 
a normalized version of Smith-Waterman score. 

3 . 2 . Approaches compared 
We compare the following approaches: 

• BGM!21 Bipartite graph model; 

• BY(2009)P Bipartite local model; 

^The datasets were 

http: //web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/ 


download 


from 
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• Laarhoven et al (2011)!^ Bipartite local model with network- 
based; similarity 

• BLM: Ignoring ‘new candidate’ in BLMN; 

• BLMN: BLM with neighbor-based training data inferring . 

Among the above methods, BGM requires eigendecomposition of a {ric + 
Ud) X (jic+nd) matrix, which is computational consuming for large datasets. 
The BY(2009), Laarhoven et al (2011) and BLM are three variants of the 
classic BLM method, which is not applicable to new candidates. BLMN 
is the modified BLM method which can be used to predict the interaction 
between any compounds and proteins. 

3.3. Evaluation 

Leave-one-out cross validation (LOOCV) is performed. In each run of pre¬ 
diction, one drug-target pair is left out by setting the corresponding entry 
of matrix A to 0. Then we try to recover its true value using the remaining 
data. We measure the quality of the predicted interaction matrix P by 
comparing it to the true interaction matrix A in terms of the area under 
ROC curve or true positive rate (TPR) vs. false positive rate (FPR) curve 
(AUC) and the area under the precision vs. recall curve (AUPR). TPR is 
equivalent to recall. Assume that TP, FP, TN, FN represent true positive, 
false positive, true negative, and false negative, respectively, then 


TPR/recall = 

TP 

(59) 

TP-^FN 

FPR = 

FP 

(60) 

FP-^TN 

precision = 

TP 

(61) 

TP+ FP 


Since in the current task, the known interactions are much less than those 
unknown ones, the precision-recall curve should be a better measurement 
than the ROC curve here as has been discussed in.S^l 

3.4. Performance comparison 

Table S] gives the AUC and AUPR scores of five approaches on the four 
datasets. The results of BCM, BY (2009), and Laarhoven et al (2011) are 
the best ones reported irP^and.l^ Both BLMN and BLM are run with three 
different groups of inputs: Chem-Seq, Network-based, and Hybrid. Chem- 
Seq denotes that chemical similarity is used for drug and sequence similarity 
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is used for target; Network-based denotes that the drug-drug similarity and 
target-target similarity are derived from the existing interaction network; 
Hybrid denotes that the drug-drug similarity and target-target similarity 
are combinations of the two types of similarities. 

It is shown from the table that with a low time complexity, four BLM- 
based approaches, including three BLM variants and BLMN, produce better 
results than the BGM method. Among the three BLM variants, the results 
of BLM and BY(2009) with Chem-Seq are similar as the only difference 
between them is the former use RSL as the classifier while the later use 
SVM. The results of BLM and Laarhoven et al (2011) with Network-based 
are also close in most of the cases although the later used Kronecker prod¬ 
uct, which is a more complicated way to combine two types of similarities. 
In all the cases, BLMN produced better results than the three classic BLM 
algorithms. This clearly show that neighbor-based training data inferring 
is very useful for improving the final result when the dataset contains new 
drug/target candidates. 

Despite the consistent improvements of BLMN compared to the other 
three on all the four datasets, the amounts of improvements differ for differ¬ 
ent datasets. If we compare the improvements of the proposed approaches 
over the four datasets, it is seen that the improvement with respect to BLM 
on Nuclear Receptor is the most significant while the improvement on En¬ 
zyme and Ion Channel are not so significant. Such kind of differences in 
performance of the proposed approach are consistent with our expectation 
according to the differences in the structure of the datasets. Although all 
the datasets do not contain new drug/target candidates, in our experiment, 
the real interaction to be predicted is leave out. This means drugs and tar¬ 
gets with degree equal to 1 turn out to have no positive training data and 
thus they are simulated to be “new” in the experiments. As shown in Table 
m Nuclear Receptor has a much larger portion of “new” drugs and targets 
than Ion Channel. Therefore, it has more chances for BLMN to improve 
the results for Nuclear Receptor where the training data inferring is applied 
more frequently. 

It is also observed that although network-derived similarity alone pro¬ 
vides good information, combining biological information can further im¬ 
proves the result especially when the network is sparse, e.g., the results 
of both BLM and BLMN for Ion Channel with only Network-hasedis very 
close to those with Hybrid while significant improvements are achieved for 
both approaches on Nuclear Receptor when Chem-Seq is further combined 
with Network-based similarity. This shows that combining multiple types 








March 13, 2015 1:18 


World Scientific Review Volume - 9in x 6in 


classification'app 


29 


Table 4. Comparison of AUC and AUPR for the four datasets 


Dataset 

Data 

Method 

AUG 

AUPR 

Enzyme 

Chem-Seq 

BGM 

96.7 

83.1 



BY(2009) 

97.6 

83.3 



BLM 

96.1 

85.8 



BLMN 

98.0 

87.3 


Network- has ed 

Laarhoven et al (2011) 

98.3 

88.5 



BLM 

98.2 

88.0 



BLMN 

99.1 

93.1 


Hybrid 

Laarhoven et al (2011) 

97.8 

91.5 



BLM 

98.2 

91.3 



BLMN 

98.8 

92.9 

Ion Channel 

Chem-Seq 

BGM 

96.9 

77.8 



BY(2009) 

97.3 

78.1 



BLM 

97.0 

81.9 



BLMN 

97.8 

84.6 


Network- has ed 

Laarhoven et al (2011) 

98.6 

92.7 



BLM 

98.5 

92.5 



BLMN 

99.0 

95.6 


Hybrid 

Laarhoven et al (2011) 

98.4 

94.3 



BLM 

98.5 

92.7 



BLMN 

99.0 

95.0 

GPCR 

Chem-Seq 

BGM 

94.7 

66.4 



BY(2009) 

95.5 

66.7 



BLM 

95.1 

68.1 



BLMN 

98.1 

78.8 


Network- has ed 

Laarhoven et al (2011) 

94.7 

71.3 



BLM 

94.4 

70.6 



BLMN 

97.5 

84.6 


Hybrid 

Laarhoven et al (2011) 

95.4 

79.0 



BLM 

95.7 

76.2 



BLMN 

98.4 

86.5 

Nuclear Receptor 

Chem-Seq 

BGM 

86.7 

61.0 



BY(2009) 

88.1 

61.2 



BLM 

86.9 

58.4 



BLMN 

96.9 

80.7 


Network- has ed 

Laarhoven et al (2011) 

90.6 

61.0 



BLM 

90.9 

62.9 



BLMN 

95.7 

80.7 


Hybrid 

Laarhoven et al (2011) 

92.2 

68.4 



BLM 

94.0 

72.4 



BLMN 

98.1 

86.6 


of similarities usually gives better results when no single type of similarity 
is good enough. 
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4. Summary 

Classification is an important data analysis tool that have been studied ex¬ 
tensively. Many computational biology tasks are binary classification prob¬ 
lem that predicts the outcome of a trial is positive or negative. We have 
introduced several popular supervised learning methods for classification 
including popular classification methods, regression models used for classi¬ 
fication, and ensemble classification. We give more detailed discussion of 
how different classification methods can be used for drug-target interaction 
prediction. Experimental studies are given to compare the performance of 
different approaches with benchmark datasets. 

Other than the specific learning method, the classification result is also 
highly dependent on the amount and quality of the given training data and 
the way the data represented, e.g., a set of features or similarity measures. 
Given the same set of training data, a good data representation with a 
simple classiher may already produces a good result. Nevertheless, with 
the same data representation, an advanced classification algorithm is able 
to make use of it more effectively and hence produce a better result. This 
chapter focus on algorithm design. 
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